Skip to content

feat(gax): implement dynamic channel refreshing on 401 retries#13212

Draft
blakeli0 wants to merge 1 commit into
googleapis:mainfrom
blakeli0:feat/gax-mwlid-channel-refresh
Draft

feat(gax): implement dynamic channel refreshing on 401 retries#13212
blakeli0 wants to merge 1 commit into
googleapis:mainfrom
blakeli0:feat/gax-mwlid-channel-refresh

Conversation

@blakeli0
Copy link
Copy Markdown
Contributor

This PR implements dynamic channel refreshing on 401 Unauthenticated retries under the isMwlidEnvironment environment variable. It introduces compile-time type-safe refresh contracts across TransportChannel and ApiCallContext, with debouncing protection in ChannelPool to prevent connection stampedes.

@blakeli0 blakeli0 force-pushed the feat/gax-mwlid-channel-refresh branch from 4f508a8 to 9e55d01 Compare May 15, 2026 21:51
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to automatically refresh transport channels when an UnauthenticatedException occurs, specifically within environments where the isMwlidEnvironment variable is set. Key changes include adding a refresh method to the TransportChannel interface and ChannelPool implementation, incorporating a 5-second debounce for refreshes, and updating the retry logic to trigger these refreshes. Review feedback highlights a potential bug in the debounce initialization, suggests using constants for magic numbers, recommends caching environment variable lookups to improve performance, and advises using imports instead of fully qualified names for better readability.

private ScheduledFuture<?> resizeFuture = null;

private final Object entryWriteLock = new Object();
private long lastRefreshTimeNanos = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Initializing lastRefreshTimeNanos to 0 can lead to the first refresh being skipped if System.nanoTime() returns a value close to zero (which is possible depending on the JVM's arbitrary time origin). Additionally, the 5-second debounce interval should be defined as a constant.

Suggested change
private long lastRefreshTimeNanos = 0;
private static final long REFRESH_DEBOUNCE_THRESHOLD_NANOS = java.util.concurrent.TimeUnit.SECONDS.toNanos(5);
private long lastRefreshTimeNanos = System.nanoTime() - REFRESH_DEBOUNCE_THRESHOLD_NANOS;

// replaces the list)
synchronized (entryWriteLock) {
long now = System.nanoTime();
if (now - lastRefreshTimeNanos < TimeUnit.SECONDS.toNanos(5)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the newly defined constant for the debounce threshold instead of a hardcoded magic number.

Suggested change
if (now - lastRefreshTimeNanos < TimeUnit.SECONDS.toNanos(5)) {
if (now - lastRefreshTimeNanos < REFRESH_DEBOUNCE_THRESHOLD_NANOS) {

*/
@Override
public ApiFuture<ResponseT> submit(RetryingFuture<ResponseT> retryingFuture) {
if ("true".equalsIgnoreCase(System.getenv("isMwlidEnvironment"))) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing environment variables via System.getenv on every RPC attempt (inside submit) is inefficient. Consider caching this value in a static final boolean field to avoid repeated lookups and potential performance overhead.

Comment on lines +137 to +141
if (cause instanceof com.google.api.gax.rpc.UnauthenticatedException) {
RetryingContext context = retryingFuture.getRetryingContext();
if (context instanceof com.google.api.gax.rpc.ApiCallContext) {
com.google.api.gax.rpc.TransportChannel transportChannel =
((com.google.api.gax.rpc.ApiCallContext) context).getTransportChannel();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of fully qualified names for UnauthenticatedException, ApiCallContext, and TransportChannel makes the code verbose and harder to read. It is recommended to use imports instead.

Comment on lines +41 to +44
if ("true".equalsIgnoreCase(System.getenv("isMwlidEnvironment"))
&& previousThrowable instanceof UnauthenticatedException) {
return true;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for checking the isMwlidEnvironment environment variable and the exception type is duplicated across both shouldRetry method overloads. Consider consolidating this into a private helper method and caching the environment variable result to improve maintainability and performance.

@blakeli0 blakeli0 force-pushed the feat/gax-mwlid-channel-refresh branch from 9e55d01 to 188158f Compare May 15, 2026 21:58
lastAttemptResult.get();
} catch (java.util.concurrent.ExecutionException e) {
Throwable cause = e.getCause();
if (cause instanceof com.google.api.gax.rpc.UnauthenticatedException) {
Copy link
Copy Markdown
Contributor

@vverman vverman May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having com.google.api.gax.rpc classes (e.x. ApiCallContext, TransportChannel) inside ScheduledRetryingExecutor creates a circular package dependency with com.google.api.gax.retrying package.

We can define the MtlsRotationHandler within gax.retrying which can be inferred from the RetryingContext. The executor only interacts with the MtlsRotationHandler.

Copy link
Copy Markdown
Contributor

@vverman vverman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think this design is elegant and handles the unary approach well. There might be changes needed to accomodate cert-mismatches and avoid circular dependencies.

The only open concern remains streaming requests which we cannot leverage the RetryExecutor for. Since we don't want to retry a failed stream request and instead just refresh the channel, IIUC, we would be restricted to using a per call interceptor.

public boolean shouldRetry(
RetryingContext context, Throwable previousThrowable, ResponseT previousResponse) {
if ("true".equalsIgnoreCase(System.getenv("isMwlidEnvironment"))
&& previousThrowable instanceof UnauthenticatedException) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This checks only for UNAUTHENTICATED requests, instead we should check for cert-mismatch which can be done by passing some param with the context. However, it is key that we avoid redundant disk-reads since many requests could fail simultaneously.

// replaces the list)
synchronized (entryWriteLock) {
long now = System.nanoTime();
if (now - lastRefreshTimeNanos < TimeUnit.SECONDS.toNanos(5)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: One potential concern here is if in the 5 nano second gap the certs rotate and the request fails due to a cert-mismatch. This could lead to valid requests failing.

I think we can build a workaround by using the cert-mismatch as a trigger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants